{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro to Python and Web Scraping\n",
    "\n",
    "## Info\n",
    "- Scott Bailey (CIDR), *scottbailey@stanford.edu*\n",
    "- Javier de la Rosa (CIDR), *versae@stanford.edu*\n",
    "- Ashley Jester (CIDR/SSDS), *ajester@stanford.edu*\n",
    "\n",
    "## Goal\n",
    "\n",
    "By the end of our workshop today, we hope you'll understand basic syntax in Python for variables, functions, and control flow. We also hope you'll know enough about the process of web scraping and some standard packages in Python to successfully scrape information off of a basic, well-formatted web site. \n",
    "\n",
    "## Topics\n",
    "- Imports\n",
    "- Variables and types/structures (String, Int, List)\n",
    "- Functions\n",
    "- Control flow\n",
    "- Web scraping with Requests and BeautifulSoup\n",
    "- Writing text to a file\n",
    "\n",
    "## Setup and packages we need in our environment\n",
    "We'll be using Anaconda with Jupyter Notebooks for this workshop. For setting up both, please see https://github.com/sul-cidr/python_workshops/blob/master/setup.ipynb\n",
    "\n",
    "For this workshop, we'll need an environment with the following packages:\n",
    "- requests\n",
    "- beautifulsoup4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports\n",
    "- At the top of your script/file, do imports. \n",
    "- Import whole module\n",
    "- Import part of a module"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "import os\n",
    "import requests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Types and variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Strings\n",
    "greeting = \"Hello, I'm Scott. It's a pleasure to meet you.\"\n",
    "# After you run this cell, note the difference between printing out in Jupyter and getting the\n",
    "# output from the last line of the cell\n",
    "print(greeting)\n",
    "greeting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Find a letter by index\n",
    "greeting[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get the length of a string\n",
    "len(greeting)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Count spaces in the string\n",
    "greeting.count(' ')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Slice to get the first 3 characters\n",
    "greeting[0:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get the last three characters\n",
    "greeting[-3:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Replace hello with goodbye\n",
    "greeting.replace(\"Hello\", \"Goodbye\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# String concatenation\n",
    "\"Hello\" + \"World\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Numbers\n",
    "# Integer and floats\n",
    "first_num = 10\n",
    "second_num = 5.467\n",
    "print(type(first_num), type(second_num))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Addition\n",
    "1 + 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Division\n",
    "10 / 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Multiplication\n",
    "5 * 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Lists\n",
    "drinks = ['coffee', 'tea', 'water']\n",
    "drinks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Python allows you to create lists of different types\n",
    "mixed = [2, 'hello', 10.5, 'here is a sentence']\n",
    "mixed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Get item by index\n",
    "drinks[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Add an item to the end of the list\n",
    "drinks.append('juice')\n",
    "drinks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Splitting a string - note the type of the output\n",
    "greeting_words = greeting.split(' ')\n",
    "greeting_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Joining a list of strings \n",
    "' '.join(greeting_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are plenty of other data types and structures that we aren't going to use today, such as: sets, dictionaries, tuples, and so forth. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Functions\n",
    "\n",
    "At the most basic level, functions are chunks of reusable code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Define a function\n",
    "def add(x, y):\n",
    "    return x + y\n",
    "\n",
    "add(1, 2)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def combine_arrays(array1, array2):\n",
    "    new_list = array1 + array2\n",
    "    return new_list\n",
    "\n",
    "first = ['hello', 2]\n",
    "second = ['1', 10]\n",
    "new = combine_arrays(first, second)\n",
    "new"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;\">\n",
    "<p style=\"margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;\">\n",
    "Activity\n",
    "</p>\n",
    "<p style=\"margin: 0.5em 1em 0.5em 1em; padding: 0;\">\n",
    "In the cell below, experiment with the add function defined above. What happens if you put in two strings? A string and an integer? A list and a string?\n",
    "</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Experiment with using different and mixed variable types with add(x, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;\">\n",
    "<p style=\"margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;\">\n",
    "Activity\n",
    "</p>\n",
    "<p style=\"margin: 0.5em 1em 0.5em 1em; padding: 0;\">\n",
    "\n",
    "Pig latin is a language game where you take the first letter of a word, move it to the back of the word, then add '-ay' at the end. For example, 'pig latin' would be 'igpay atinlay' and 'python' would turn into 'ythonpay'.\n",
    "\n",
    "In the cell below, write a function that takes a string, lowercases it, and returns the pig latin translation of the word. You'll need to use slicing and string concatenation to make this work. \n",
    "</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def pig_latinize(word):\n",
    "    ...\n",
    "    return ...\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Control flow "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# IF\n",
    "name = \"Bob\"\n",
    "\n",
    "if name == \"Scott\":\n",
    "    print(\"Hi Scott!\")\n",
    "else:\n",
    "    print(\"Who are you?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# You can use control flow with functions\n",
    "# Also, you can if, else if, and else to specify more than one condition\n",
    "name = \"John\"\n",
    "\n",
    "def say_hello(name):\n",
    "    return \"Hello \" + name + \"!\"\n",
    "\n",
    "if (name == \"Bob\"):\n",
    "    message = say_hello(\"Bob\")\n",
    "    print(message)\n",
    "elif (name == \"Scott\"):\n",
    "    message = say_hello(\"Scott\")\n",
    "    print(message)\n",
    "else:\n",
    "    print(\"Who are you?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# FOR loops let you iterate over a list or other iterable object\n",
    "names = [\"Stu\", \"Scott\", \"Javier\", \"Ashley\"]\n",
    "for name in names:\n",
    "    print(name, len(name))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# You can combine types of control flow\n",
    "for name in names[:3]:\n",
    "    if len(name) > 5:\n",
    "        print(name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def add_one(num):\n",
    "    return num + 1\n",
    "\n",
    "nums = [1, 2, 3, 4]\n",
    "plus = []\n",
    "for num in nums:\n",
    "    plus.append(add_one(num))\n",
    "plus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# ADVANCED: List Comprehensions\n",
    "# List comprehensions are a \"pythonic\" way of building lists in a compact manner\n",
    "\n",
    "added = [add_one(num) for num in nums]\n",
    "added"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "long_names = [name.lower() for name in names[:3] if len(name) > 5]\n",
    "long_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;\">\n",
    "<p style=\"margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;\">\n",
    "Activity\n",
    "</p>\n",
    "<p style=\"margin: 0.5em 1em 0.5em 1em; padding: 0;\">\n",
    "In the cell below, write a function that loops over a list and returns a new list where all the strings have been replaced with their pig latin translations. \n",
    "\n",
    "For example, if your list is `['hello', 5, 'world']` your output should be `['ellohay', 5, 'orldway']`.\n",
    "\n",
    "Feel free to reuse the pig latinizer you wrote above. You'll also need to think about checking the type of each item in the list. \n",
    "</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def pig_latinize_list(items):\n",
    "    ...\n",
    "    return ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Web scraping with Requests and Beautiful Soup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scraping text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We'll use the requests library to carry out an HTTP request on the url\n",
    "# Then use BeautifulSoup to parse the HTML\n",
    "url = \"https://en.wikipedia.org/wiki/Stanford_University\"\n",
    "page = requests.get(url)\n",
    "soup = BeautifulSoup(page.text, \"html5lib\")\n",
    "soup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We can use the find method to specify an HTML element to find,\n",
    "# and pass attributes such as class or id to find specific elements\n",
    "# Find only returns the first element found\n",
    "hatnote = soup.find('div', {'class': 'hatnote'})\n",
    "hatnote"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# The get_text method pulls just the text from a chunk of HTML\n",
    "hat_text = hatnote.get_text()\n",
    "hat_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Within a chunk of HTML we've found, we can use find again to find another html element\n",
    "main_text_area = soup.find('div', {'class': 'mw-content-ltr'})\n",
    "main_text = main_text_area.find('p')\n",
    "main_text.get_text()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We can use find_all to find every instance of an HTMl element\n",
    "# find_all returns an object we can iterate over\n",
    "paragraphs = soup.find_all('p')\n",
    "type(paragraphs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for para in paragraphs:\n",
    "    print(para.get_text())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Another text scraping example\n",
    "\n",
    "Let's create a list of urls for the chapters of A Byte of Python, iterate over the first few, and get that page content.\n",
    "\n",
    "A Byte of Python is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, allowing us to copy the book, distribute it, transmit it, remix it and so forth. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "url = \"https://python.swaroopch.com/\"\n",
    "page = requests.get(url)\n",
    "soup = BeautifulSoup(page.text, \"html5lib\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We can chain together methods to find one element, then find all instances of another\n",
    "# element within that HTML block\n",
    "chapters = soup.find('nav').find_all('a')\n",
    "chapters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We can use square brackets to access the value of an attribute, such as the href of a link\n",
    "for a in chapters:\n",
    "    print(a['href'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Since the href didn't give us a full url, we use a function to build one\n",
    "def create_url(url):\n",
    "    return 'https://python.swaroopch.com/' + url\n",
    "\n",
    "# Then use a list comprehension to create a list of full urls of chapters\n",
    "chapter_links = [create_url(a['href']) for a in chapters[2:-1]]\n",
    "chapter_links"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# We've used this chunk of code several times, so let's make it a function that specifically\n",
    "# gets the text from a chapter page\n",
    "def get_page_text(url):\n",
    "    page = requests.get(url)\n",
    "    soup = BeautifulSoup(page.text, \"html5lib\")\n",
    "    return soup.find('section', {'class': 'markdown-section'}).get_text()\n",
    "\n",
    "for url in chapter_links:\n",
    "    print(get_page_text(url))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing text to a file\n",
    "However you prefer, create a directory/folder named 'chapters' at the same level as the file for this notebook. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# In the below functions, I've put in docstrings, which let you document the purpose of a \n",
    "# function, its parameters, and what it returns \n",
    "\n",
    "# The first function breaks apart a filename, builds a path including a directory,\n",
    "# then puts the right file extension at the end\n",
    "def create_filename(name, dirname):\n",
    "    \"\"\"\n",
    "    Builds a filename\n",
    "    \n",
    "    Args:\n",
    "        name (string) - the name of the file to be written\n",
    "        dirname (string) - the name of the directory to contain the files\n",
    "        \n",
    "    Returns:\n",
    "        filename (string) - path to the file\n",
    "    \"\"\"\n",
    "    chunks = name.split('.')\n",
    "    filename = os.path.join(dirname, chunks[0] + '.txt')\n",
    "    return filename\n",
    "\n",
    "def create_url(url):\n",
    "    \"\"\"\n",
    "    Takes a final chunk of a url and creates a full url\n",
    "    \n",
    "    Args:\n",
    "        url (string) - the url with file extension, e.g. 'dedication.html'\n",
    "        \n",
    "    Returns a full url (string)\n",
    "    \"\"\"\n",
    "    return 'https://python.swaroopch.com/' + url\n",
    "\n",
    "def get_page_text(url):\n",
    "    \"\"\"\n",
    "    Pulls html from the url, creates a beautiful soup object, and gets the text from the page\n",
    "    \n",
    "    Args:\n",
    "        url (string) - the url for the page from which you want text\n",
    "        \n",
    "    Returns the text (string) from the page\n",
    "    \"\"\"\n",
    "    page = requests.get(url)\n",
    "    soup = BeautifulSoup(page.text, \"html5lib\")\n",
    "    return soup.find('section', {'class': 'markdown-section'}).get_text()\n",
    "\n",
    "# Iterate over the chapter links, create a filename for each, get the text for each, \n",
    "# then write it to a local file in the chapters directory\n",
    "for a in chapters[2:-1]:\n",
    "    filename = create_filename(a['href'], 'chapters')\n",
    "    text = get_page_text(create_url(a['href']))\n",
    "    with open(filename, 'w') as f:\n",
    "        f.write(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;\">\n",
    "<p style=\"margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;\">\n",
    "Activity\n",
    "</p>\n",
    "<p style=\"margin: 0.5em 1em 0.5em 1em; padding: 0;\">\n",
    "One type of common use for web scraping is to gather content for analysis, such as sentiment analysis. You may have seen data-driven journalists offer sentiment analysis of political content. We won't do the analysis, but let's practice scraping news headlines off of a news site.</p> \n",
    "<p style=\"margin: 0.5em 1em 0.5em 1em; padding: 0;\">In the cell below, scrape the article titles from ProPublica's page on the current presidential administration - https://www.propublica.org/trump-administration/. You'll need to look at the html code of the page to locate the right markup to find. Think about first finding all the articles, then iterating over those to find each title. You can either just print out each title, or put it into a list. </p>\n",
    "<p style=\"margin: 0.5em 1em 0.5em 1em; padding: 0;\">Always check any given site's terms of use, content policy, and robots.txt file before scraping it. In this case, content has a Creative Commons license and the robots.txt file seems to allow a robot to hit pages for categories of articles. </p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
